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ABSTRACT: THE IT INDUSTRY HAS SEEN REVOLUTION FROM 
MIGRATING FROM STANDARDIZATION TO INTEGRATION TO 
VIRTUALIZATION TO AUTOMATION TO THE CLOUD. NOW THE 
INDUSTRY IS ALL SET TO SPIN AROUND THE COMMERCIALIZATION 
THAT IS DATA ANALYTICS- BUSINESS INTELLIGENCE. FROM ALL 
FIELDS DATA IS GENERATING BE IT ANY INDUSTRY SECTOR. THUS 
VOLUME, VARIETY AND VELOCITY OF THE DATA HAVE BEEN 
EXTREMELY HIGH. THUS TO HANDLE SUCH ENORMOUS DATA 
WHERE TRADITIONAL DATABASES IS NOT POSSIBLE THE PROBLEM 
OF STORAGE, COMPUTION,LOW NETWORK BANDWIDTH AND LESS 
FAULT TOLERANT WHICH LEAD TO THE INTRODUCTION OF 
BIGDATA.IN THIS PAPER WE HAVE FOCUSED ON THE BACKEND 
ARCHITECTURE AND WORKING OF THE PARTS OF THE HADOOP 
FRAMEWORK WHICH ARE THEMAP REDUCE FOR THE 
COMPUTATIONAL AND ANALYTICS SECTION AND THE HADOOP 
DISTRIBUTED FILE SYSTEM (HDFS) FOR THE STORAGE SECTION. 
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I. Introduction 

With the industrial revolution of data, tremendous amount of 
data is generated .with the emergence of companies the data 
which was confined to few gigabytes has now gone past peta 
into zetta bytes. Technology is so much in use that we are in an 
era that we are able to figure out about human behaviour 
through the analysis and prediction of the data generated. Data 
is generated through machine 

sensors, GPS,billing,transactions. Emergence of new data 
sources has gone so high that the storage capabilities have fell 
short. The traditional datawarehouses are limited to RDBMS 
concept which could handle more of the structured data but 
when in this era when we the data is generating in all 
directions flexible unstructured data storages NoSQL 
databases are the new crush of the industry. The amount of 
unstructured data generated can we figured out by the fact that 
every month 1 lakh new users are registered on facebook.5 
billion mobile phones are in user in 2010, 30 billion new 
pieces of content is created or shared on Facebook. "Bigdata" 
refers to datasets whose size is beyond the ability of typical 
database software tools to capture, store, manage, and analyse. 
Now theindustry is in making sense of these generated figures 
by analysis and prediction of different parameters. 
Datawarehouses are also an important part when it comes to 

analytics. "Bigdata" can we implemented on both structure or 
unstructured data that is on both analytical DBMS and NoSQL 
databases. Bigdataisproved as an asset when it comes to 
analyse the data in motion or stream processing. Most of the 
larger firms are generating huge amounts of data. With the 



coming of cloud models that incorporate sound data storage 
companies are processing enormous data. This huge generated 
data is not only a hardwaredata storage problem but also on 
file system design, designing implementation,IO Processing 
and scalability issue. To fulfil the needs of the data generated 
data storage has significantly improved. But HDD data access 
has not improved that much. Thus the main problems with this 
emergence of data are particularly where to store this 
enormous data or the storage capacity problem. The second 
one is to make business sense out of it for analytics which is a 
part of computation problem. Other important factors include 
the network bandwidth and the reliability. Reliability refers to 
the response if any unfavourable condition materializes 
which can lead to the loss of important data and in turn leads 
to the analysis flaw of the system. Thus a backup of the data 
stored should always be present to cope up with the risk 
situations of data loss. One other major concept is the concept 
of network bandwidth. 

Thus storage, 

computation,reliability,bandwidth issues are some of the bigdata 
problems which the modern IT industry is facing. And yes 
Hadoop framework can be a best framework which can provide 
with these features and other additional features which could 
turn out to be an asset for the industry. In this paper we would 
be discussing in detail the methodology by which the Hadoop 
frame work can help us achieve the above discussed challenges. 

Architecture and Functioning 

MapReduce: The analysis part of the Hadoop framework is 
managed by the mrvl framework. It is a programming model 
developed by google. It works on the principle of 
divide,sort,merge,join.It was built with the aim of batch 
processing and parallel processing.lt is natural for the ad-hoc 
query, web search indexing, Log processing. From business 
aspect,the main objective of MapReduce is deep data 
analytics based on which the prediction is done observing the 
patterns. It comprises of two functions, to analyse the large 
unstructured datasets, the "Mappers" and the "Reducers". 
Both of the "Mappers" and the "Reducers" are user defined 
functions. The model is based on parallel programming and 
the datasets are parallely processed on the different nodes of 
the cluster. Map and Reduce functions are available in 
languages such as LISP. Apart from themap and reduce 
function also comprises the partitioner and the combiner 
functions. Users of MapReduce are allowed to specify the 
number of reducer tasks they desire according to which the 
data gets partitioned among these tasks through the 
partitioning function. There is also a combiner function; the 
combiner function is executed on every node that performs 
map function.it merges the local disk data before moving it to 
the network.The mechanism for MapReduce is as mainly 
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divide and conquer, the main program is initiated and an 
dataset is taken as input and the according to the job 
configuration the master program initiates the various notes 
for map and reduce purposes, after that the input reader is 
being initialized to stream the data from the datasets the input 
reader breaks file into many smaller blocks and maps the data 
blocks to the nodes which are assigned mapper nodes. As 
told above the map and reduce functions are user defined, 
thus in the mapper nodes the user map function is executed 
and based on it {key , value} pairs are generated , the results 
generated by the mappers is not simply written to the 
disk,some sorting is done for the efficiency reasons. Map 
tasks have circular memory buffer in which it stores the 
output,by default its capacity is 100 MB,it can change 
dynamically to the size, when the threshold size reaches 80%, 
a background thread will start to spill the contents of 
thread. Map blocks until the spill is complete. Before writing 
to the disk respective sorting is done on the pairs generated 
now the already initiated "Reducer" nodes comes into action. 
All the sorted data are sent to the reducer nodes by the 
partionerfunction here it collects the same keyvalue items 
andthe user given reduce function and aggregates result as a 
collective entity. Partion and combiner function is applied on 
the output of the sort result so that there is less data to be 
written onto the disk.The produced result is collected by 
output reader and thus the parallel processing terminates. 
Architecture of MapReduce consists ofJobtrackerand 
multiple trackers. Job tracker acts as the master and the task 
trackers act as the slaves. Jobtrackersits onto the Namenode 
and the tasktracker sits on the correspondingDatanodes.when 
the task is being submitted to theNamenode and the job 
tracker is being informed about the input, via heartbeat 
protocol it checks for the free slots in the task tracker and 
assigns maptask to the free tasktrackers.Maptasks track data 
from the splits using record reader and input format and 
invoke map function andaccordingly a key value pair is 
generated in the memory buffer. Once all the tasktrackers are 
done with the maptask the memory buffer is flushed to the 
local disk within mapnode with an index and the keyvalue 
pair the map nodes report to the Jobtracker and the Jobtracker 
starts notifying the reduce task nodes of the cluster for the 
next step which is the reduce task. The concerned reduce 
nodes download the files (index and keyvalue pair) from the 
respective mapnode. Now the reduce nodes reads the 
downloaded file involve the userdefined reduce function and 
that provides with the aggregate key value pair. Each reduce 
tasks are single threaded. The output of each reducer task is 
written to HDFS temporary file. When all reducetasks are 
finished the temporary file is automatically renamed to final 
file name. 

HDFS: in traditional blocks of disks the maximum data that 
can be stored or read was 512bytes,later the file systems 
blocks came which could accommodate few kilobytes, with the 
current volume of data it is next to impossible to store or 
analyse this teravytes or zettabytes data over a distributed 
network using traditional system. Hadoop distributed file 
system is a Hadoop data storage framework implemented on 
the commodity hardware. HDFS blocks can accommodate a 
few 68-128 MB. Block extraction is simple in HDFS like 
replication of blocks is at block level rather than file 
level.HDFS is created keeping MapReduce in mind. HDFS 



(ISSN : 2277-1581) 
1 Jan 2014 

represents a distributed file system that is designed to store 
enormously large datasets and at the same time high throughput 
to access datasets. HDFS contains many racks which are 
mounted by thousands of servers and with each server 
thousands of nodes are attached so the probability of the failure 
of the hardware is at its peak. So the Hadoop design should be 
resistant to the fault tolerance, have high throughput for data 
streaming. 
Characteristics or goals of HDFS: 

1. High fault tolerance. 

2. Moving computation is better than moving data. 

3. able to handle large datasets. 

4. Cross platform compatibility. 

5. High throughput for streaming data. 

Architecture:HDFS run on GNU/Linux operating system and 
is built in java. HDFS works on the principle of master/slave 
architecture. It consists of a Namenode which is unique for the 
whole cluster; there is a secondary Namenode which acts as a 
checkpoint. Rest all the nodes of cluster are said to be the 
Datanodes these act as the slaves. Namenode acts as the 
master instructing the Datanodes to perform operations. When 
a large dataset is set to be entering into stored in HDFS the 
large file is split into numerous blocks, these blocks can reside 
of the same file are stored on different nodes of the 
cluster,each block stored is stored as a file on the local file 
system.HDFS maintains a single namespace for the distributed 
file system,this namespace is maintained in the 
Namenode,since the blocks are distributed over the cluster and 
the Datanode store in the local file system, this file system tree 
and the metadata and directories in the trees is also maintained 
in the Namenode. This information is dynamic in nature. Name 
node consists of 2 files for storing all these data which are 
Fslmage and the edit log respectively. FSimage stores data 
block file mapping and filesystem properties whereas the edit 
log consist ofall the changes done to the file system,all the 
modifications to blocks are subjected to the editlog. For the 
proper working of HDFS the most important thing is the 
master the Namenode if the Namenode fails the HDFS 
becomes obsolete,it should be throughout functional,if it fails 
there could be huge dataloss since HDFS is used by massive 
datasets. Though we cannot fully control the Namenode failure 
but we can minimize its effect by having checkpoints. We 
have secondary name node for it which merges the fsi image 
and the edit log periodically, when the metadata from the 
Namenode is stored on the local disk it is also mounted on to 
N mountpoints just as a backup. These CPU intensive merge 
activities are on the separate system. If at any moment the 
Namenode fails then the fsi image from the mounted sites is 
picked up and it runs as the primary Namenode. this is how 
secondary Namenodes can be vitial.The working of HDFS is 
kept very simple and dynamic, when the system starts the 
system is in a neutral state waiting for the data nodes to send 
information about the vacant blocks so that the name node can 
assign the block to Datanode, via heartbeat protocol and block 
reports the name not get these messages based on which the 
Namenode allots the different data chunks to the different data 
nodes. IfNamenodefails secondary node acts as the Namenode 
as discussed earlier. After this the Namenode works on the 
block replication if any less replication is done thanthe 
replication factor it works for it until it fulfils. As the 
Namenode boots the Fslmage and the editlog are accessed 
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from the local disk and all the editlog transactions are mapped 
into existingFsImage thus creating newFsImage file, 
meanwhile the old editlog is flushed, that's how it is dynamic. 

Reliability: An exceptional quality which the Hadoop 
framework persist is that when the input file is to store in 
HDFS frame work it goes through the splitting of the large 
dataset in to smaller chunks. The blocks of data are replicated 
over different nodes of the cluster. Replication is done on the 
data node level. Replication factor is introduced which is the 
number of replicas available of the same block. This provides 
fault tolerance, for eg. If a rack fails then all the corresponding 
nodes to that fail so by replication we have the same data 
block over other blocks thus we can access the required data 
block increasing reliability. Replication over the same node is 
avoided because replication or backup over the same node is 
of no use since a node fail its back up is also gone thus 
Hadoop uses replication around different nodes of the cluster. 
Also the secondary Namenode which is the back of the 
primary Namenode as discussed earlier carries the backup of 
the Fslmage and editlog to act as primary name node if the 
main Namenode fails. This ensures the reliability of the 
Hadoop framework. 

High performance: Another concern in the distributed network 
is the Network Bandwidth. Yes ,Hadoop is the solution to 
Bandwidth constrain too. Since the Hadoop uses more of the local 
data .This can be understood by this example that while 
replication if the Hadoop has a replication factor 3 (most 
prominent case) then it means it will save three of its replication 
copies on the nodes. If it strore's each replicated copy on the 
different node of the different rack that would enhance the data 
reliability and availability but what about average network 
bandwidth which is used when we fetch the block for read 
purpose? (since while read or write operation each part is to be 
fetched from different racks)For this reason another strategy is 
used,2/3 of the replicas are on the same rack and the rest 1/3 is 
done on the different node of different racks this improves 
performance, availability and fasterns the access time .thus it 
minimizes bandwidth consumption. 

Diagrams/result 

Hadoop framework which consists of two main frameworks 
which are the MapReduce framework and The 
HadoopDistributed File System are interlinked .Mapreduce is 
mainly for the compute or analysis part which is the heart of the 
BigData Analysis. where as the HDFS is mainly for the storing 
part.Both of these intra Frameworks are Hightly depended n each 
other.MasterSlave architecture exists in the Hadoop. The input 
file is didvided into multiple blocks and is saved on different 
nodes(data nodes),the replication of these blocks( to increase the 
reliablity in case of any accident) is also on on the same or 
differen racks keeping minization of network bandwidth usage in 
mind. The jobtracker sits over the namenode input file is being 
sent to the namenode which divides and the file blocks are saved 
on the Datanodes this is the storage section ,in case of any read or 
write operation the jobtracker( master) on the namenode asks the 
task trackers to do the mapper and the reducers tasks respectively 
this is the computation part of the Hadoop. 
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Conclusion 

Maximum amount of industry generated data is unstructured. 
Even if it is structured it is so huge that the traditional RDBMS 
is a fail for storing Enormous variety ,volume and velocity of 
the data.Hadoop framework is an asset as it helps in achieving 
the mail goals of the industry such as the storage, computer 
and analysis,reliability and fault tolerance, last but not the least 
the network bandwidth. Thus using Hadoop we can distributed 
store the data using HDFS and compute it according to the 
user defined functions in MapReduse. 
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